Intestinal Metagenomics

Provided with fastq.gz files, I need to establish the spatial structuring of microbial communities using three biopsies along an intestine.

Broad questions:

  • Are any microbes biopsy-specific?
  • Do biopsies have high microbial community overlap, or are there tissue-specific communities?

Specific questions:

  1. How was the sequencing run? Anything to change if we could go back in time?
  2. What taxonomic groups are present in the samples?
  3. Which samples contain microbes that can produce colibactin?
  4. Do some exploratory data analysis. Make some figures and do some stats and relate it back to the broad questions.

Work will entail a combination of bash, python, and R. Since I am working on this myself, I will track the workflow in markdown. I find this the easiest way to track sequential workflows, allowing minor iterative adjustments throughout the project.

Report

Metagenomics has provided valuable insights into the differences in microbial species along the gastrointestinal (GI) tract and their role in various health conditions, including cancer and liver disease (Delik et al., 2022; Kharofa et al., 2023; Mannion et al., 2023). While many microbiome studies rely on 16S rRNA gene sequencing or fecal sampling for accessibility, the use of shotgun metagenomics offers a genome-resolved understanding of microbial communities directly from tissue biopsies (Cheng et al., 2022). Here, I briefly analyze shotgun reads from rectum and terminal ileum biopsies in three patients, potentially enriched using microbial-enrichment methods (Wu-Woods et al., 2023), to identify any biopsy-specific microbial species capable of producing colibactin, a genotoxic molecule implicated in cancer (Vizcaino & Crawford, 2015).

Methods

Quality control

A total of 18 libraries were analyzed, encompassing three patients, two tissues (rectum, terminal ileum), and three replicates. Interleaved paired-end reads were separated into forward and reverse files and analyzed independently with FastQC v0.12.1 (Andrews, 2010) and compiled using MultiQC v1.19 (Ewels et al., 2016). Reads were filtered using Knead Data v0.12.0 with default settings, which trims reads using trimmomatic v0.39 (Bolger et al., 2014), removes repeats with trf v4.09 (Benson, 1999), and identifies host (human) and common contaminants. Processed reads were then processed with FastQC and MultiQC. Reads counts from raw and filtered steps were summarized with python and visualized in R v4.3.2 with the tidyverse v2.0.0 (R Core Team, 2017; Wickham et al., 2019).

Community profiling

Filtered metagenomic reads were analyzed with MetaPhlAn v4.1.0 (Blanco-Miguez et al., 2022) and visualized with hclust2 (https://github.com/SegataLab/hclust2). Profiling was performed on three data subsets to assess sensitivity and overlap of species across various filters: 1) filtered reads at the technical replicate level (n = 18 samples); 2) filtered reads merged at the biological replicate level (n = 6 samples); 3) raw unfiltered reads merged at the biological replicate level (n = 6 samples).

Microbial species producing colibactin

All potential microbial species identified from community profiling were investigated to determine if they were capable of producing colibactin. Colibactin was first identified as being produced by Escherichia coli, leveraging a 54-kb gene cluster (Nougayrède et al., 2006), with additional evidence identifying this gene cluster across broader taxa (Kawanishi et al., 2020). I used two approaches to determine if any of our candidate species harbor genes in the NRPS and PKS enzymatic pathway involved with colibactin production (Nougayrède et al., 2006). First, I downloaded all RefSeq whole genomes for each candidate species and blasted the 64 clb gene primers against each genome using BLASTn v2.15.0+ (Camacho et al., 2009). Based on interspecific distance I used a relaxed word size of 15 and an evalue of 0.05. Secondly, I used antismash v7.1.0 (Blin et al., 2021) to predict secondary metabolite biosynthesis gene clusters using the RefSeq genomes, using prodigal for prediction (Hyatt et al., 2010). Pathways were extracted from each run and summarized in R.

Results & Discussion

Sequencing runs were somewhat successful despite high adapter content, overrepresented ‘GGGG…’ sequences, and highly fragmented libraries (Fig. 1). The initial 10 bases of the reads exhibited strange base composition, potentially from library preparation protocols. Quality control sufficiently ameliorated adapter and overrepresented sequence issues (Fig. 1). Overall, output per library was highly variable, making biological inference suspect due to variable amount of sequence across biological replicates (Figs. S1 & S2).

A maximum of 11 microbial species were identified across the libraries, with only five of those species identified across both technical and biological replicates (Fig. 2). Only a single individual (P2) exhibited consistent microbial communities, indicating that this sample was likely a target of microbial enrichment, while the remaining two individuals (P1 and P3) were not. Within individual P2, the species Mediterraneibacter faecis, Dialister invisus, and Bacteroides uniformis were only identified in the terminal ileum, while only Ruminococcus bromii was identified in the rectum. Alistipes putredinis, Phocaeicola vulgatus, and Akkermansia municiphila were found in both biopsies. Only one of these species was identified as potential colibactin-producing candidates, including Ruminococcus bromii in the rectum (Fig. 3). While Phocaeicola massiliensis was identified as a potential colibactin-producing candidate from the profiles using raw and unfiltered reads, only Ruminococcus bromii was identified from filtered reads (Fig. 2).

Future bioinformatic directions could assess specific gene pathways within the sequenced reads themselves using HuManN, assemble metagenome-assembled genomes if provided with additional reads using anvi’o, or replicate analyses with QIIME2. Overall, additional reads are required across replicates to ensure biological reproducibility.

References

Andrews, S. (2010). FastQC: A quality control tool for high throughput sequence data. Available at https://www.bioinformatics.babraham.ac.uk/projects/fastqc/ (p. Online) [Computer software].

Benson, G. (1999). Tandem repeats finder: A program to analyze DNA sequences. Nucleic Acids Research, 27(2), 573–580. https://doi.org/10.1093/nar/27.2.573

Blanco-Miguez, A., Beghini, F., Cumbo, F., McIver, L. J., Thompson, K. N., Zolfo, M., Manghi, P., Dubois, L., Huang, K. D., Thomas, A. M., Piccinno, G., Piperni, E., Punčochář, M., Valles-Colomer, M., Tett, A., Giordano, F., Davies, R., Wolf, J., Berry, S. E., … Segata, N. (2022). Extending and improving metagenomic taxonomic profiling with uncharacterized species with MetaPhlAn 4. https://doi.org/10.1101/2022.08.22.504593

Blin, K., Shaw, S., Kloosterman, A. M., Charlop-Powers, Z., van Wezel, G. P., Medema, M. H., & Weber, T. (2021). antiSMASH 6.0: Improving cluster detection and comparison capabilities. Nucleic Acids Research, 49(W1), W29–W35. https://doi.org/10.1093/nar/gkab335

Bolger, A. M., Lohse, M., & Usadel, B. (2014). Trimmomatic: A flexible trimmer for Illumina sequence data. Bioinformatics, 30(15), 2114–2120. https://doi.org/10.1093/bioinformatics/btu170

Camacho, C., Coulouris, G., Avagyan, V., Ma, N., Papadopoulos, J., Bealer, K., & Madden, T. L. (2009). BLAST+: Architecture and applications. BMC Bioinformatics, 10(1), 421. https://doi.org/10.1186/1471-2105-10-421

Cheng, W. Y., Liu, W.-X., Ding, Y., Wang, G., Shi, Y., Chu, E. S. H., Wong, S., Sung, J. J. Y., & Yu, J. (2022). High Sensitivity of Shotgun Metagenomic Sequencing in Colon Tissue Biopsy by Host DNA Depletion. Genomics, Proteomics & Bioinformatics, S167202292200119X. https://doi.org/10.1016/j.gpb.2022.09.003

Delik, A., Dinçer, S., Ülger, Y., Akkız, H., & Karaoğullarından, Ü. (2022). Metagenomic identification of gut microbiota distribution on the colonic mucosal biopsy samples in patients with non-alcoholic fatty liver disease. Gene, 833, 146587. https://doi.org/10.1016/j.gene.2022.146587

Ewels, P., Magnusson, M., Lundin, S., & Käller, M. (2016). MultiQC: Summarize analysis results for multiple tools and samples in a single report. Bioinformatics, 32(19), 3047–3048. https://doi.org/10.1093/bioinformatics/btw354

Hyatt, D., Chen, G.-L., LoCascio, P. F., Land, M. L., Larimer, F. W., & Hauser, L. J. (2010). Prodigal: Prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics, 11(1), 119. https://doi.org/10.1186/1471-2105-11-119

Kawanishi, M., Shimohara, C., Oda, Y., Hisatomi, Y., Tsunematsu, Y., Sato, M., Hirayama, Y., Miyoshi, N., Iwashita, Y., Yoshikawa, Y., Sugimura, H., Mutoh, M., Ishikawa, H., Wakabayashi, K., Yagi, T., & Watanabe, K. (2020). Genotyping of a gene cluster for production of colibactin and in vitro genotoxicity analysis of Escherichia coli strains obtained from the Japan Collection of Microorganisms. Genes and Environment, 42(1), 12. https://doi.org/10.1186/s41021-020-00149-z

Kharofa, J., Haslam, D., Wilkinson, R., Weiss, A., Patel, S., Wang, K., Esslinger, H., Olowokure, O., Sohal, D., Wilson, G., Ahmad, S., & Apewokin, S. (2023). Analysis of the fecal metagenome in long‐term survivors of pancreas cancer. Cancer, 129(13), 1986–1994. https://doi.org/10.1002/cncr.34748

Mannion, A., Sheh, A., Shen, Z., Dzink-Fox, J., Piazuelo, M. B., Wilson, K. T., Peek, R., & Fox, J. G. (2023). Shotgun Metagenomics of Gastric Biopsies Reveals Compositional and Functional Microbiome Shifts in High- and Low-Gastric-Cancer-Risk Populations from Colombia, South America. Gut Microbes, 15(1), 2186677. https://doi.org/10.1080/19490976.2023.2186677

Nougayrède, J.-P., Homburg, S., Taieb, F., Boury, M., Brzuszkiewicz, E., Gottschalk, G., Buchrieser, C., Hacker, J., Dobrindt, U., & Oswald, E. (2006). Escherichia coli Induces DNA Double-Strand Breaks in Eukaryotic Cells. Science, 313(5788), 848–851. https://doi.org/10.1126/science.1127059

R Core Team. (2017). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/. (p. Online) [Computer software].

Vizcaino, M. I., & Crawford, J. M. (2015). The colibactin warhead crosslinks DNA. Nature Chemistry, 7(5), 411–417. https://doi.org/10.1038/nchem.2221

Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L., François, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T., Miller, E., Bache, S., Müller, K., Ooms, J., Robinson, D., Seidel, D., Spinu, V., … Yutani, H. (2019). Welcome to the Tidyverse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686

Wu-Woods, N. J., Barlow, J. T., Trigodet, F., Shaw, D. G., Romano, A. E., Jabri, B., Eren, A. M., & Ismagilov, R. F. (2023). Microbial-enrichment method enables high-throughput metagenomic characterization from host-rich samples. Nature Methods, 20(11), 1672–1682. https://doi.org/10.1038/s41592-023-02025-4

Time Allocation

Time_Spent
Time_Spent

Time spent on various tasks.

Baseline project structure

Baseline structure and software installs. Working in UNIX here to get started. While I don’t have much explicit experience with microbial genomics, it seems like anvi’o is a program that can adequately address most components of this task, but I will model most of my work on the methods from Natalie & Jacob’s natmeth paper to maintain lab continuity.

First, I will set up a conda environment for this project. I will use mamba (a wrapper for conda) because it is way faster for installs and dependency resolving:

# version control
mamba --version
mamba 1.5.8
conda 24.1.2

# create my mamba env, using 
mamba create -n microbes python=3.10
mamba activate microbes

# install the massive dependencies - tidyverse alone usually takes like 15 mins, so run it and switch tasks 
mamba install -y -c conda-forge -c bioconda python=3.10 \
        sqlite prodigal idba mcl muscle=3.8.1551 famsa hmmer diamond \
        blast megahit spades bowtie2 bwa graphviz "samtools>=1.9" \
        trimal iqtree trnascan-se fasttree vmatch r-base r-tidyverse \
        r-optparse r-stringi r-magrittr bioconductor-qvalue meme ghostscript \
        r-rcolorbrewer fastqc multiqc trimmomatic bbmap trf seqkit metaphlan #match lab colors
mamba install -y -c bioconda fastani

# test anvio, set custom tmp directory because my cluster has a super long path which messes with python 
TMPDIR=/dss/dsshome1/lxc07/di39dux/microbe/tmp

# seems like the lab uses kneaddata for contaminant removal, so get that set up 
pip install kneaddata

# grab human database
### ... total nightmare. My cluster doesn't allow these sorts of xfers without certificates, have to scp local and re-upload
kneaddata_database --download human_genome bowtie2 /dss/dsslegfs01/pr53da/pr53da-dss-0021/projects/2021__Cuckoo_Resequencing/tmp

# and finally, get biobakery setup, targeting HUMAnN v.3.0, this seems to require python 2.7 (>2.7 and <2.8), so it needs its own environm,ent 
mamba config --add channels defaults
mamba config --add channels bioconda
mamba config --add channels conda-forge
mamba config --add channels biobakery
mamba create -n humann2 -c biobakery humann #version 3.9 

QC [Q1]

Initial Check

Run fastqc:

# fastQC on all files
fastqc *.fastq.gz --outdir=./fastqc_results

Compile with multiqc:

multiqc ./fastqc_results -o ./multiqc_report

Pertaining to:

Was this a successful sequencing run? What metrics did you use to determine this? Is there anything you would change if you could re-sequence again?

Initial thoughts:

  • Ileum biospy from P3 likely too many PCR cycles, high duplicates

  • Systemic read count biases among samples will make comparisons difficult, e.g. All P3 rectum have low reads, all P1 ileum have high reads

  • Base composition within the first ~10 bases is problematic. This persists after adapter trimming, suggesting this might be due to the technical specifications of library prep (e.g. enzymatic digestion .. ?).

    • Likely due to tagmentation, see here
  • There’s a severely over-represented sequence GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG in 16/18 samples. Maybe created in silico as a technical artifact trap, otherwise that’s a pretty strange adapter. Nextera transposase adapters detected throughout.

  • There is very high variability in the adapter dimers between replicates - some replicate inserts appear severely fragmented and are only ~25bp of sequence followed by 65bp of adapter.

  • Bad base quality, systemically related to sequencing tile. I find this pretty typical in many datasets not really a concern since trimming takes care of it.

Let’s trim and decontaminate and check again.

Trimming + Decontaminating

First, just split the reads into R1 and R2:

output_dir=/dss/dsshome1/lxc07/di39dux/merondun/microbes/raw_fastq/split
mkdir $output_dir

#loop through each interleaved FASTQ file
for file in raw_fastq/*-interleaved.fastq.gz; do
    # Extract the base name without the '-interleaved.fastq.gz' part
    base_name=$(basename "$file" "-interleaved.fastq.gz")
    
    # Define output files
    r1_output="$output_dir/${base_name}_R1.fastq.gz"
    r2_output="$output_dir/${base_name}_R2.fastq.gz"
    
    # Split interleaved FASTQ into R1 and R2
    reformat.sh in="$file" out1="$r1_output" out2="$r2_output"
    
    echo "Processed $file"
done

Run fastqc:

# fastQC on all files
fastqc *.fastq.gz --outdir=./fastqc_results

Compile with multiqc:

multiqc ./fastqc_results -o ./multiqc_report

Remove host DNA (human) and common contaminants and trim adapters, kneaddata default selects nextera which is fine.

Note that trimmomatic, if using conda, can cause headaches, see here.

#directory with raw data
raw=/dss/dsshome1/lxc07/di39dux/merondun/microbes/raw_fastq/split
trimmed=/dss/dsshome1/lxc07/di39dux/merondun/microbes/trimmed_fastq
ref_db=/dss/dsslegfs01/pr53da/pr53da-dss-0021/projects/2021__Cuckoo_Resequencing/tmp/kneaddata_db

mkdir trimmed_fastq

for SAMPLE in $(cat Samples.list); do 

echo "Working on sample: ${SAMPLE}"

kneaddata --input1 ${raw}/${SAMPLE}_R1.fastq.gz --input2 ${raw}/${SAMPLE}_R1.fastq.gz --reference-db $ref_db --output ${trimmed} --trimmomatic /dss/dsshome1/lxc07/di39dux/mambaforge/envs/microbes/share/trimmomatic-0.39-2/

done 

Delete any empty files fpr clean-up afterwards:

find . -type f -empty -delete

Re-Run QC

Now, re-run fastqc/multiqc on the trimmed and decontaminated reads:

Run fastqc:

# fastQC on all files
fastqc *kneaddata_paired_* --outdir=./fastqc_results

Compile with multiqc:

multiqc ./fastqc_results -o ./multiqc_report

I export the multiqc reports as svg and compile them in inkscape.

Fig1
Fig1

Fig. 1. Quality control from 9 paired-end metagenomic shotgun libraries, spanning 3 individuals, 2 biopsies, and 3 replicates. (a) MultiQC results indicating the number of unique and duplicate reads for each raw library, indicating problematic variability across individuals. (b) MultiQC results for reads post processing with kneaddata, indicating exacerbated read differences among libraries. (c) MultiQC heatmaps showing library quality metrics, indicating problematic sequence composition, GC content, and artificial sequences pre-QC. (d). Post-QC analysis indicates that overrepresented sequences and adapters are solved, but base composition and GC content are still problematic, putatively from library prep.

Reads After Filtering

I compile all the fastq up until this point into a new directory, and count the number of reads in each file using seqkit:

seqkit stats * | sed 's/,//g' > Read_Counts.txt

Script for grabbing $num_reads, $num_bases from each fastq. This is just a quick hacky way to assign ‘filter’ levels based on strings in each file:

import argparse
import pandas as pd

def process_file(input_file, output_file):
    # Load seqkit output 
    df = pd.read_csv(input_file, sep='\s+', engine='python')

    # Parse the first seqkit column to extract ID, Location, Replicate, and Read_pair
    df[['ID', 'Location', 'Replicate', 'Read_pair']] = df['file'].str.extract(r'(P\d+)_(Ileum|Rectum)_(\d+)_R(\d+)')

    # Based on strings, assign 'filter' levl 
    conditions = [
        (df['file'].str.contains('.gz')),
        (df['file'].str.contains('paired_contam')),
        (df['file'].str.contains('kneaddata_paired')),
        (df['file'].str.contains('repeats')),
        (df['file'].str.contains('trimmed')),
        (df['file'].str.contains('unmatched'))
    ]
    choices = ['Raw', 'Contaminants', 'Filtered', 'Repeatless', 'Trimmed', 'Dropped']
    df['Filter'] = pd.np.select(conditions, choices, default='Dropped')

    # Grab those columns 
    df = df[['ID', 'Location', 'Replicate', 'Filter', 'num_seqs', 'sum_len', 'avg_len', 'Read_pair']]

    # save tsv
    df.to_csv(output_file, sep='\t', index=False)

def main():
    parser = argparse.ArgumentParser(description="Process read count files.")
    parser.add_argument("input_file", help="Input file name")
    parser.add_argument("--out", dest="output_file", help="Output file name", required=True)

    args = parser.parse_args()

    process_file(args.input_file, args.output_file)

if __name__ == "__main__":
    main()

Simple as: python Summarize_Counts.py --out Read_Counts_Output.txt Read_Counts.txt

Summarize the read counts in R:

### plot QC pre and post trimming 
setwd('/dss/dsshome1/lxc07/di39dux/merondun/microbes/read_counts')
.libPaths('~/mambaforge/envs/microbes/lib/R/library')
library(tidyverse)
library(RColorBrewer)

###### Count Host vs Non-Host Reads ######

counts <- read_tsv('/dss/dsshome1/lxc07/di39dux/merondun/microbes/read_counts/Read_Counts_Output.txt')
counts <- counts %>% filter(Filter != 'Dropped') %>% arrange(ID,Location,Replicate,Filter)

#merge R1 and R2, e.g. double the counts and then remove redundant rows. For averages, retain only unique values
counts_summarized <- counts %>% 
  group_by(ID,Location,Replicate,Filter) %>% 
  summarize(
    num_seqs = sum(num_seqs),
    sum_len = sum(sum_len),
    avg_len = unique(avg_len)
  ) %>% distinct

# add factor so the filters are in descending order, ending with contaminants 
counts_summarized$Filter = factor(counts_summarized$Filter,levels=c('Raw','Trimmed','Repeatless','Filtered','Contaminants'))

plot = counts_summarized %>% 
  ggplot(aes(x=Replicate,y=num_seqs,fill=Filter))+
  geom_bar(stat='identity',position=position_dodge(width=0.9))+
  scale_fill_manual(values=brewer.pal(5,'Set2'))+
  facet_grid(Location~ID,scales='free')+
  theme_bw() + 
  ylab('Number of Sequences')
plot

png('Counts_By_Filters_2024MAY07.png',units='in',res=600,height=4,width=6)
plot
dev.off()

#what is the number of host reads vs passed-QC reads? e.g. contaminants / trimmed 
#From NatMeth paper: After bioinformatic host removal, the percentages of host reads were calculated by 
#dividing reads remaining after host filtering by the total reads that passed quality control.
# to me, this is actually the number of target microbiome reads, not contaminant reads? Contam should contain host (human) + common contaminants?

proportion_target = counts_summarized %>% filter(Filter == 'Filtered' | Filter == 'Trimmed') %>% 
  group_by(ID,Location,Replicate) %>% 
  summarize(Proportion_Target = num_seqs[Filter == 'Filtered'] / num_seqs[Filter == 'Trimmed']) %>% 
  ggplot(aes(x=Location,y=Proportion_Target,fill=as.factor(Replicate)))+
  geom_bar(stat='identity',position=position_dodge(width=0.9))+
  geom_text(aes(x=Location,y=Proportion_Target,label=paste0(round(Proportion_Target*100,1),'%')),
            position=position_dodge(width=0.9),vjust=-.5,size=1.5)+
  facet_grid(.~ID,scales='free')+
  ylab('Proportion of Target Reads')+
  scale_fill_manual(values=brewer.pal(3,'Dark2'))+
  theme_bw(base_size=8)
proportion_target

png('Proportion_Target_Reads_2024MAY07.png',units='in',res=600,height=2,width=6)
proportion_target
dev.off()

FigS1
FigS1

Fig. S1. Number of reads for each library across various QC stages after processing from knead data. Colors correspond to the number of reads from raw data, data after trimming, reads after removing repeats, final reads after removing host and contaminants, and host and contaminant reads.

And what proportion of reads are ‘target’? E.g. which reads pass filters and are NOT human or contaminants? (second plot from above script)

FigS2
FigS2

Fig. S2. Proportion of target reads within each library, summarized as the number of ‘Filtered’ reads which do not contain host or contaminant DNA, compared to the ‘Trimmed’ input DNA. Putatively individual P2 was successfully enriched.

Microbial Communities [Q2]

I’ll use metaphlan v4.1.0 for profiling.

I just subset the files so they are named nicely in a new directory:

for SAMPLE in $(cat Samples.list); do 

echo "Moving sample: ${SAMPLE}"

cp trimmed_fastq/${SAMPLE}_*kneaddata_paired_1.fastq species_profiling/${SAMPLE}_R1.fastq
cp trimmed_fastq/${SAMPLE}_*kneaddata_paired_2.fastq species_profiling/${SAMPLE}_R2.fastq

done 

So now we just have our filtered, de-contaminated reads here: /dss/dsshome1/lxc07/di39dux/merondun/microbes/species_profiling

ls -lhtr
total 52M
-rw-r--r-- 1 di39dux uj502 1.1M May  7 19:12 P1_Ileum_1_R1.fastq
-rw-r--r-- 1 di39dux uj502 1.1M May  7 19:12 P1_Ileum_1_R2.fastq
-rw-r--r-- 1 di39dux uj502 594K May  7 19:12 P1_Ileum_2_R1.fastq
-rw-r--r-- 1 di39dux uj502 594K May  7 19:12 P1_Ileum_2_R2.fastq
-rw-r--r-- 1 di39dux uj502 848K May  7 19:12 P1_Ileum_3_R1.fastq
-rw-r--r-- 1 di39dux uj502 848K May  7 19:12 P1_Ileum_3_R2.fastq
-rw-r--r-- 1 di39dux uj502 355K May  7 19:12 P1_Rectum_1_R1.fastq
-rw-r--r-- 1 di39dux uj502 355K May  7 19:12 P1_Rectum_1_R2.fastq
-rw-r--r-- 1 di39dux uj502 355K May  7 19:12 P1_Rectum_2_R1.fastq
-rw-r--r-- 1 di39dux uj502 355K May  7 19:12 P1_Rectum_2_R2.fastq
-rw-r--r-- 1 di39dux uj502 628K May  7 19:12 P1_Rectum_3_R1.fastq
-rw-r--r-- 1 di39dux uj502 628K May  7 19:12 P1_Rectum_3_R2.fastq
-rw-r--r-- 1 di39dux uj502 826K May  7 19:12 P2_Ileum_1_R1.fastq
-rw-r--r-- 1 di39dux uj502 826K May  7 19:12 P2_Ileum_1_R2.fastq
-rw-r--r-- 1 di39dux uj502 5.3M May  7 19:12 P2_Ileum_2_R1.fastq
-rw-r--r-- 1 di39dux uj502 5.3M May  7 19:12 P2_Ileum_2_R2.fastq
-rw-r--r-- 1 di39dux uj502 3.3M May  7 19:12 P2_Ileum_3_R1.fastq
-rw-r--r-- 1 di39dux uj502 3.3M May  7 19:12 P2_Ileum_3_R2.fastq
-rw-r--r-- 1 di39dux uj502 3.3M May  7 19:12 P2_Rectum_1_R1.fastq
-rw-r--r-- 1 di39dux uj502 3.3M May  7 19:12 P2_Rectum_1_R2.fastq
-rw-r--r-- 1 di39dux uj502 1.9M May  7 19:12 P2_Rectum_2_R1.fastq
-rw-r--r-- 1 di39dux uj502 1.9M May  7 19:12 P2_Rectum_2_R2.fastq
-rw-r--r-- 1 di39dux uj502 4.1M May  7 19:12 P2_Rectum_3_R1.fastq
-rw-r--r-- 1 di39dux uj502 4.1M May  7 19:12 P2_Rectum_3_R2.fastq
-rw-r--r-- 1 di39dux uj502 1.2M May  7 19:12 P3_Ileum_1_R1.fastq
-rw-r--r-- 1 di39dux uj502 1.2M May  7 19:12 P3_Ileum_1_R2.fastq
-rw-r--r-- 1 di39dux uj502 968K May  7 19:12 P3_Ileum_2_R1.fastq
-rw-r--r-- 1 di39dux uj502 968K May  7 19:12 P3_Ileum_2_R2.fastq
-rw-r--r-- 1 di39dux uj502 913K May  7 19:12 P3_Ileum_3_R1.fastq
-rw-r--r-- 1 di39dux uj502 913K May  7 19:12 P3_Ileum_3_R2.fastq
-rw-r--r-- 1 di39dux uj502 361K May  7 19:12 P3_Rectum_1_R1.fastq
-rw-r--r-- 1 di39dux uj502 361K May  7 19:12 P3_Rectum_1_R2.fastq
-rw-r--r-- 1 di39dux uj502 110K May  7 19:12 P3_Rectum_2_R1.fastq
-rw-r--r-- 1 di39dux uj502 110K May  7 19:12 P3_Rectum_2_R2.fastq
-rw-r--r-- 1 di39dux uj502 350K May  7 19:12 P3_Rectum_3_R1.fastq
-rw-r--r-- 1 di39dux uj502 350K May  7 19:12 P3_Rectum_3_R2.fastq

Metaphlan does not used paired reads, just provide both as comma separated as per here .

HOLY FRIGHT we need to download another 15gb database. Exploiting storage from one of my cuckoo projects because home is getting full:

metadb=/dss/dsslegfs01/pr53da/pr53da-dss-0021/projects/2021__Cuckoo_Resequencing/tmp/meta_db
metaphlan --install --bowtie2db ${metadb}

Metaphlan by tech replicate

And loop through samples:

#!/bin/bash

#SBATCH --get-user-env
#SBATCH --mail-user=merondun@bio.lmu.de
#SBATCH --clusters=serial
#SBATCH --partition=serial_std
#SBATCH --mem=20000mb
#SBATCH --cpus-per-task=10
#SBATCH --time=24:00:00

metadb=/dss/dsslegfs01/pr53da/pr53da-dss-0021/projects/2021__Cuckoo_Resequencing/tmp/meta_db

for SAMPLE in $(cat Samples.list); do 

metaphlan ${SAMPLE}_R1.fastq,${SAMPLE}_R2.fastq --bowtie2out profiles/${SAMPLE} --unclassified_estimation --bowtie2db ${metadb} --input_type fastq --nproc 10 > profiles/${SAMPLE}_profile.txt 

done

Well, seems like there’s no microbes, at least in the random sample I ran a test on.

WARNING: The number of reads in the sample (1350) is below the recommended minimum of 10,000 reads.
WARNING: MetaPhlAn did not detect any microbial taxa in the sample.

Let’s scale it up to all 18 libraries, and then merge the outputs:

merge_metaphlan_tables.py *_profile.txt > merged_abundance_table_techreps.txt

Create input for heatmap:

grep -E "s__|clade_name" merged_abundance_table_techreps.txt \
    | grep -v "t__" \
    | sed "s/^.*|//g" \
    > merged_abundance_table_species_techreps.txt

And plot heatmap. This uses correlation for features and Euclidean distance for samples (default):

hclust2.py \
    -i merged_abundance_table_species_techreps.txt \
    -o metaphlan4_abundance_heatmap_species_techreps.svg \
    --cell_aspect_ratio 2 \
    --flabel_size 10 --slabel_size 10 \
    --max_flabel_len 100 --max_slabel_len 100 \
    --dpi 300
FigS3
FigS3

Fig. S3. Abundance estimates of each species within each technical replicate using MetaPhLaN profiles, indicating detectable species largely within individual P2.

Metaphlan by bio replicate

What if we merge all the replicates for a sample? First impressions are that this dataset is way too tiny.

Strip last 2 characters (replicate) from the IDs:

cat Samples.list | sed 's/..$//' | sort | uniq > IDS.list
for ID in $(cat IDS.list); do 
cat ${ID}*_R1.fastq > merged_reads/${ID}_R1.concat.fastq
cat ${ID}*_R2.fastq > merged_reads/${ID}_R2.concat.fastq
done 

Now re-run on the concatenated fastq:

#!/bin/bash

#SBATCH --get-user-env
#SBATCH --mail-user=merondun@bio.lmu.de
#SBATCH --clusters=cm2_tiny
#SBATCH --partition=cm2_tiny
#SBATCH --cpus-per-task=5
#SBATCH --time=6:00:00

metadb=/dss/dsslegfs01/pr53da/pr53da-dss-0021/projects/2021__Cuckoo_Resequencing/tmp/meta_db

cd /dss/dsshome1/lxc07/di39dux/merondun/microbes/species_profiling/merged

for SAMPLE in $(cat Samples.list); do 

metaphlan ${SAMPLE}.concat.fastq --bowtie2out profiles/${SAMPLE} --unclassified_estimation --bowtie2db ${metadb} --input_type fastq --nproc 5 > profiles/${SAMPLE}_profile.txt 

done
WARNING: MetaPhlAn did not detect any microbial taxa in the sample.
WARNING: MetaPhlAn did not detect any microbial taxa in the sample.
WARNING: The metagenome profile contains clades that represent multiple species merged into a single representant.
An additional column listing the merged species is added to the MetaPhlAn output.
WARNING: The metagenome profile contains clades that represent multiple species merged into a single representant.
An additional column listing the merged species is added to the MetaPhlAn output.
WARNING: The metagenome profile contains clades that represent multiple species merged into a single representant.
An additional column listing the merged species is added to the MetaPhlAn output.
WARNING: The number of reads in the sample (6340) is below the recommended minimum of 10,000 reads.
WARNING: MetaPhlAn did not detect any microbial taxa in the sample.

Better…

Merge the outputs:

merge_metaphlan_tables.py *_profile.txt > merged_abundance_table.txt

Create input for heatmap:

grep -E "s__|clade_name" merged_abundance_table.txt \
    | grep -v "t__" \
    | sed "s/^.*|//g" \
    > merged_abundance_table_species.txt

And plot heatmap. This uses correlation for features and Euclidean distance for samples (default):

hclust2.py \
    -i merged_abundance_table_species.txt \
    -o metaphlan4_abundance_heatmap_species.svg \
    --cell_aspect_ratio 0.5 \
    --flabel_size 10 --slabel_size 10 \
    --max_flabel_len 100 --max_slabel_len 100 \
    --dpi 300
FigS4
FigS4

Fig. S4. Abundance estimates of each species within each biological replicate using MetaPhLaN profiles, indicating detectable species largely within individual P2. Species identified from both technical replicate and biological replicate analyses are indicated with a star.

Sanity, raw data

Also perform metaphlan on the raw data, to see if our trimming and decontaminating was overly conservative. Merge by biological replicate for expediency:

for ID in $(cat IDS.list); do 
zcat ${ID}*gz > merged/${ID}.concat_raw.fastq
done

Run again, on our annoying little tiny alternate cluster because the evolution folks are hogging our CPUs!

#!/bin/bash

#SBATCH --get-user-env
#SBATCH --mail-user=merondun@bio.lmu.de
#SBATCH --clusters=cm2_tiny
#SBATCH --partition=cm2_tiny
#SBATCH --cpus-per-task=5
#SBATCH --time=6:00:00

metadb=/dss/dsslegfs01/pr53da/pr53da-dss-0021/projects/2021__Cuckoo_Resequencing/tmp/meta_db

cd /dss/dsshome1/lxc07/di39dux/merondun/microbes/raw_fastq/merged

for ID in $(cat IDS.list); do

metaphlan ${ID}.concat_raw.fastq --bowtie2out profiles/${ID} --unclassified_estimation --bowtie2db ${metadb} --input_type fastq --nproc 5 > profiles/${ID}_profile.txt

done

Merge the outputs:

merge_metaphlan_tables.py *_profile.txt > merged_abundance_table_raw.txt

Create input for heatmap:

grep -E "s__|clade_name" merged_abundance_table_raw.txt \
    | grep -v "t__" \
    | sed "s/^.*|//g" \
    > merged_abundance_table_species_raw.txt

And plot heatmap. This uses correlation for features and Euclidean distance for samples (default):

hclust2.py \
    -i merged_abundance_table_species_raw.txt \
    -o metaphlan4_abundance_heatmap_species_raw.svg \
    --cell_aspect_ratio 0.5 \
    --flabel_size 10 --slabel_size 10 \
    --max_flabel_len 100 --max_slabel_len 100 \
    --dpi 300
FigS5
FigS5

Fig. S5. Abundance estimates of each species within each raw library using MetaPhLaN profiles, indicating detectable species largely within individual P2, but some species found within P1 and P3.

INTERESTING! Raw data detects additional species in P1 and P3, but I’m skeptical with my current rudimentary understanding of metagenomics filtering / decontaminating.

All results

So, this indicates that likely, cumulatively with the % target DNA, that P2 is the only library which was enriched for microbial sequences.

Fig2
Fig2

Fig. 2. Abundance estimates of each species within each raw library using MetaPhLaN profiles across raw data and filtered data for technical and biological replicates. (a) Abundance estimates using the raw unfiltered reads, but merging at the biological replicate level. (b) Estimates at the technical replicate level. (c) Estimates at the biological replicate level, with species identified in all 3 analyses indicated with an asterisk. The library P3_Ileum is indicated with a red asterisk to denote these species were not recovered at the technical replicate level.

Summary take-away: P3 exhibited P. vulgatus when the libraries were merged, but were not detected at the technical replicate level. This microbe was detected at high levels upon merging, but not at all before, making me skeptical of it.

Pathway: colibactin [Q3]

Okay, so colibactin is produced by enteric bacteria, which includes E. coli. Microbes which produce colibactin seem to have a nice helpful 54-kb gene cluster which is diagnostic, and seems to widely distributed among taxa.

First, I will grab the genomes for all potential species identified. Ideally I would use the fastq reads and assemble them into a MAG and annotate it, or use humann to identify genes directly.

Collect genomes

Hacky way to quickly grab genomes for our target species:

wget https://ftp.ncbi.nlm.nih.gov/genomes/refseq/assembly_summary_refseq.txt

Not pretty, much streamlined:

refseq_db=/dss/dsslegfs01/pr53da/pr53da-dss-0021/projects/2021__Cuckoo_Resequencing/tmp/assembly_summary_refseq.txt

for SPECIES in $(cat Species.list); do 

SP=$(echo $SPECIES | sed 's/_/ /g')


URL=$(grep "$SP" $refseq_db | tail -n 1 | sed 's/.*https/https/g' | sed 's/\t.*//g')
GCF=$(grep "$SP" $refseq_db | tail -n 1 | awk '{print $1}')
ASM=$(grep "$SP" $refseq_db | tail -n 1 | sed 's/.*ASM/ASM/g' | sed 's/\t.*//g')
SANITY=$(grep "$SP" $refseq_db | tail -n 1 | awk '{print $8, $9}')

echo "URL for species: ${SPECIES}, matches ${SANITY}, @ ${URL}, ACC: ${GCF}, ASM: ${ASM}" 

wget "${URL}/${GCF}_${ASM}_genomic.fna.gz"
mv "${GCF}_${ASM}_genomic.fna.gz" ${SPECIES}.fa.gz

done

Output:

URL for species: Akkermansia_muciniphila, matches Akkermansia muciniphila, @ https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/024/086/545/GCF_024086545.1_ASM2408654v1, ACC: GCF_024086545.1, ASM: ASM2408654v1
URL for species: Alistipes_onderdonkii, matches Alistipes onderdonkii, @ https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/037/347/475/GCF_037347475.1_ASM3734747v1, ACC: GCF_037347475.1, ASM: ASM3734747v1
URL for species: Alistipes_putredinis, matches Alistipes putredinis, @ https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/037/429/465/GCF_037429465.1_ASM3742946v1, ACC: GCF_037429465.1, ASM: ASM3742946v1
URL for species: Bacteroides_uniformis, matches Bacteroides uniformis, @ https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/003/471/545/GCF_003471545.1_ASM347154v1, ACC: GCF_003471545.1, ASM: ASM347154v1
URL for species: Dialister_invisus, matches Dialister invisus, @ https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/037/266/035/GCF_037266035.1_ASM3726603v1, ACC: GCF_037266035.1, ASM: ASM3726603v1
URL for species: Mediterraneibacter_faecis, matches Mediterraneibacter faecis, @ https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/024/463/495/GCF_024463495.1_ASM2446349v1, ACC: GCF_024463495.1, ASM: ASM2446349v1
URL for species: Oscillibacter_sp_ER4, matches , @ , ACC: , ASM:
URL for species: Phocaeicola_massiliensis, matches 204516 Phocaeicola, @ https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/382/445/GCF_000382445.1_Bact_mass_DSM_17679_V1, ACC: GCF_000382445.1, ASM: GCF_000382445.1
URL for species: Phocaeicola_vulgatus, matches Phocaeicola vulgatus, @ https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/024/463/175/GCF_024463175.1_ASM2446317v1, ACC: GCF_024463175.1, ASM: ASM2446317v1
URL for species: Ruminococcus_bromii, matches Ruminococcus bromii, @ https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/014/870/125/GCF_014870125.1_ASM1487012v1, ACC: GCF_014870125.1, ASM: ASM1487012v1
URL for species: Sphingomonas_echinoides, matches Sphingomonas echinoides, @ https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/029/623/395/GCF_029623395.1_ASM2962339v1, ACC: GCF_029623395.1, ASM: ASM2962339v1

Individually download Oscillibacter_sp_ER4 and Phocaeicola_massiliensis because they don’t match perfectly.

wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/765/235/GCA_000765235.1_ASM76523v1/GCA_000765235.1_ASM76523v1_genomic.fna.gz
mv GCA_000765235.1_ASM76523v1_genomic.fna.gz Oscillibacter_sp_ER4.fa.gz

wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/382/445/GCA_000382445.1_Bact_mass_DSM_17679_V1/GCA_000382445.1_Bact_mass_DSM_17679_V1_genomic.fna.gz
mv GCA_000382445.1_Bact_mass_DSM_17679_V1_genomic.fna.gz Phocaeicola_massiliensis.fa.gz

Blast clb Primers

For a quick and dirty job, I’ll just grab the primer sequences from the clb genes from E. coli from page S18 from this paper.

Corresponds to these primers:

Primer Sequence
clbA-F TTTAGGGGTGATGAGTGGAGAGGCT
clbA-R TCATCAAACCAGTAGAGATAACTTCCTTCACT
clbB-F TGTTCCGTTTTGTGTGGTTTCAGCG
clbB-R GTGCGCTGACCATTGAAGATTTCCG
clbC-F TTGACGGAGGCGTTCGATACTTCAC
clbC-R ACTTGTATCACTCGGCGGCAATCAA
clbD-F CGGAGAATGTAGTCGGCGTCCATTT
clbD-R CCCTGATTTCACGCCCAAATACCCT
clbF-F CGATTGCCCTCACAGAGCCGAATAT
clbF-R AATGCCATGAGAAAATAACCGCCGC
clbG-F CGAATATGCTGCGCTGACCTGTAGT
clbG-R GATAGCGATTCTCCAGCAGCAGGTT
clbH-F CTTTGTCGAGTTGCCGGAATACCCT
clbH-R TGTGTCTGATCTCCTGTGGTCCCTT
clbI-F TTGAGAATGTACGACTGAACCCGCC
clbI-R AATGAATGTCCGCCAGCTTCGAAGA
clbJ-F TGGCCTGTATTGAAAGAGCACCGTT
clbJ-R AATGGGAACGGTTGATGACGATGCT
clbK-F TTGATGATCACCACGCCAGCTTCTT
clbK-R GCGGATGGCGGTAGTGATAAGCTAG
clbL-F CACAGGTGTCTATGCCCATCGTTGT
clbL-R GCCGACCACTGAGTTTGACTGCTAT
clbM-F TGTTTCAAGGCGCGGGTAAGATCAT
clbM-R TAGTCACTCACGGCAACAACACGAG
clbN-F GGCATTCAGTTCGGGTATGTGTGGA
clbN-R AACAGAGCTGCCGTAAAGACTCGAC
clbO-F AAGGAGGTGCGGTAAATAACGACGG
clbO-R CGGTGGCATGGATCCTTTTCGTTTG
clbP-F CTTGCCGCAGACAATCGTATCCTCT
clbP-R CCTGGAGATAGTATACCCGGTGCGA
clbQ-F CTGTGTCTTACGATGGTGGATGCCG
clbQ-R GCATTACCAGATTGTCAGCATCGCC

Make a blast db for each genome:

for SPECIES in $(cat Species.list); do 

echo "Creating Blast DB... and blasting on ${SPECIES}"
#makeblastdb -in fasta/${SPECIES}.fa -parse_seqids -dbtype nucl -out blastdbs/${SPECIES}

blastn -query Primers.fa -db blastdbs/${SPECIES} -evalue 0.05 \
    -word_size 15 \
    -max_hsps 1 -max_target_seqs 1 \
    -outfmt '6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore sstrand' > blasted/${SPECIES}.blast

done 

E-value and word size needs to be balanced here, obviously subjective, but I tried to find a balance where we identified zero hits in some species and few in others. This would provide the most probable real hits with these short sequences.

Table S1. BLASTn results from the clb gene clusters against each microbial genome. Species in bold were also found to harbor a NRPS gene cluster using a separate antismash analysis.

Species Gene Percent_Identity Length e-value
Alistipes_putredinis clbN-F 100 15 0.04
Bacteroides_uniformis clbF-R 90.909 22 0.023
Dialister_invisus clbI-R 100 17 0.003
Oscillibacter_sp_ER4 clbB-F 100 16 0.014
Phocaeicola_massiliensis clbF-R 100 16 0.022
Phocaeicola_massiliensis clbO-F 95 20 0.006
Ruminococcus_bromii clbJ-R 100 15 0.034
Ruminococcus_bromii clbN-F 100 15 0.034

Antismash to find pathway

Antismash requires a lot of different dependencies, so I make a new conda env

smash_db=/dss/dsslegfs01/pr53da/pr53da-dss-0021/projects/2021__Cuckoo_Resequencing/tmp/antismash

download-antismash-databases --database-dir ${smash_db}

Antismash it

smash_db=/dss/dsslegfs01/pr53da/pr53da-dss-0021/projects/2021__Cuckoo_Resequencing/tmp/antismash


for SPECIES in $(cat Species.list); do 

echo "Antismashing species: ${SPECIES}"
mkdir antismash/${SPECIES}

antismash --output-dir antismash/${SPECIES} --genefinding-tool prodigal fasta/${SPECIES}.fa --databases ${smash_db}
cp antismash/${SPECIES}/${SPECIES}.zip antismash/

done

I then inspect the index.html files and identify which gene pathway clusters are identified in each species. I tabulate this manually for now since the species are limited, and visualize it in R. Two species have clusters corresponding to NRPS , which is directly involved with producing colibactin.

### plot colibactin candidates
setwd('/dss/dsshome1/lxc07/di39dux/merondun/microbes/genomes/')
.libPaths('~/mambaforge/envs/microbes/lib/R/library')
library(tidyverse)
library(RColorBrewer)

###### Summarize Pathways ######

paths <- read_tsv('/dss/dsshome1/lxc07/di39dux/merondun/microbes/genomes/antismash_gene_clusters.txt')

pathplot = paths %>% 
  ggplot(aes(y=Gene_Clusters,col=Colibactin_Pathway,x=Species))+
  geom_point()+
  theme_bw()+
  theme(axis.text.x=element_text(angle=45,vjust=1,hjust=1))

pdf('Pathway_Plot_2024MAY08.pdf',height=3,width=6)
pathplot
dev.off()

Fig3
Fig3

Fig. 3. Candidate species capable of producing colibactin. Ruminococcus bromii is out top candidate, identified from both BLAST and antismash approaches and was identified from the biological replicate profile of filtered reads from MetaPhlAn.

Classify with humann

We could use humann to classify our decontaminated reads:

humann --version
humann v3.9

Merge the R1 and R2:

for ID in $(cat IDS.list); do cat ${ID}*fastq > ${ID}.bothreads_concat.fastq; done

And run humann, this is fast so I just run it directly. If I were to do this for real…I would download the full chocoplhan database, but it’s 15GB:

humann_databases
HUMAnN Databases ( database : build = location )
chocophlan : full = http://huttenhower.sph.harvard.edu/humann_data/chocophlan/full_chocophlan.v201901_v31.tar.gz
chocophlan : DEMO = http://huttenhower.sph.harvard.edu/humann_data/chocophlan/DEMO_chocophlan.v201901_v31.tar.gz
uniref : uniref50_diamond = http://huttenhower.sph.harvard.edu/humann_data/uniprot/uniref_annotated/uniref50_annotated_v201901b_full.tar.gz
uniref : uniref90_diamond = http://huttenhower.sph.harvard.edu/humann_data/uniprot/uniref_annotated/uniref90_annotated_v201901b_full.tar.gz
uniref : uniref50_ec_filtered_diamond = http://huttenhower.sph.harvard.edu/humann_data/uniprot/uniref_ec_filtered/uniref50_ec_filtered_201901b_subset.tar.gz
uniref : uniref90_ec_filtered_diamond = http://huttenhower.sph.harvard.edu/humann_data/uniprot/uniref_ec_filtered/uniref90_ec_filtered_201901b_subset.tar.gz
uniref : DEMO_diamond = http://huttenhower.sph.harvard.edu/humann_data/uniprot/uniref_annotated/uniref90_DEMO_diamond_v201901b.tar.gz
utility_mapping : full = http://huttenhower.sph.harvard.edu/humann_data/full_mapping_v201901b.tar.gz

So this won’t do anything useful. But this script works with the demo, but I would need the db.

#!/bin/bash

#SBATCH --get-user-env
#SBATCH --mail-user=merondun@bio.lmu.de
#SBATCH --clusters=cm2_tiny
#SBATCH --partition=cm2_tiny
#SBATCH --cpus-per-task=2
#SBATCH --time=6:00:00

metadb=/dss/dsslegfs01/pr53da/pr53da-dss-0021/projects/2021__Cuckoo_Resequencing/tmp/meta_db
humann_db=need_to_dl

cd /dss/dsshome1/lxc07/di39dux/merondun/microbes/species_profiling/merged_reads

for ID in $(cat IDS.list); do

echo "Working on sample: ${ID}"

humann --input ${ID}.bothreads_concat.fastq --taxonomic-profile profiles/${ID}_profile.txt --nucleotide-database ${humann_db} --output humann

done

Normalize according to relative abundance, as appropriate for MaAsLin, which looks interesting for follow-up work.

for ID in $(cat IDS.list); do 

humann_renorm_table --input humann/${ID}.bothreads_concat_genefamilies.tsv --output humann/${ID}.genefamilies_relab.tsv --units relab

done

Fork 2: QIIME2

Alternatively, we could analyze with QIIME2, which is potentially more appropriate for extremely low input reads we have here.

If I were to re-start this project, I would probably just go with QIIME2 instead since it is probably more appropriate for a whole workflow of QC > community ID via OTUs > maybe pathway analysis? Oh well, maybe next time.